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Abstract 

Determining the interaction partners among protein/domain families poses hard computational prob- 
lems, in particular in the presence of paralogous proteins. Available approaches aim to identify in- 
teraction partners among protein/domain families through maximizing the similarity between trimmed 
versions of their phylogenetic trees. Since maximization of any natural similarity score is computation- 
ally difficult, many approaches employ heuristics to maximize the distance matrices corresponding to the 
tree topologies in question. In this paper we devise an efficient deterministic algorithm which directly 
maximizes the similarity between two leaf labeled trees with edge lengths, obtaining a score-optimal 
ahgnment of the two trees in question. 

Our algorithm is significantly faster than those methods based on distance matrix comparison: 1 
minute on a single processor vs. 730 hours on a supercomputer Furthermore we outperform the current 
state-of-the-art exhaustive search approach in terms of precision as well as a recently suggested overall 
performance measure for mirrortree approaches, while incurring acceptable losses in recall. 

A C implementation of the method demonstrated in this paper is available at 
pittp://compbio.cs.sfu.ca/mirrort.htm| 

1 Introduction 

The vast majority of cellular functions are exerted by (combinations of) interacting gene products. As a 
result, "preservation of functionality" among proteins and other gene products typically implies "preserva- 
tion of interactions" across species. It is well established that protein-protein interactions (both physical 
interactions as well as co-occurence of domains) are preserved through speciation events (see 1 6 , 9 1 and the 
references therein). A major implication of this is that the evolutionary trees behind two interacting protein 
families can look near-identical. 

As interacting proteins have a tendency to co-evolve, it is desirable to "measure" how similarly two 
or more proteins (or other gene products) evolve to assess their possibility of being interaction partners. 
For that purpose a number of computational strategies have been developed to compare the phylogenetic 
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trees that represent two or more protein or protein-domain families. Among these strategies we will focus 
on the mirrortree approach, where the phylogenetic trees of protein or protein-domain families are called 
gene trees: here leaves represent "homologs" and internal vertexes represent either speciation or duplication 
events. There are a number of mirrortree methods described in the literature each of which based on a 
specific measure of pairwise tree similarity and an algorithm to compute it; see the introductory paper by 
m and 191 for more references. 

In the context of mirrortree approaches, direct comparison of gene trees is considered to be "... a problem 
yet to be fully resolved." S p. 2], and thus available techniques typically "measure" tree similarity in terms 
of the similarity between their "distance matrices"; the distance matrix of a gene tree is defined so that the 
entry (i, j) represents the distance between vertices i and j on the tree. Similarity between distance matrices 
of two trees can easily be computed and may be used to accurately estimate the similarity between the 
corresponding gene trees [9| in the absence of paralogous proteins. This is due to the fact that the absence 
of paralogs imply a bijection between the leaves of the two trees compared (i.e. there is exactly one vertex 
for each species in each gene tree). In the presence of paralogous proteins, however, one needs to determine 
the correct "pairing" of leaves so as to assess the "degree" of co-evolution among the two families. Note 
that it is not trivial to establish such a mapping: as pointed out in fT6l, protein interaction can be preserved 
during duplication, while interaction can be lost during speciation. 

There are a number of mirrortree approaches for determining the exact correspondence between the 
leaves of two gene trees; typically these approaches aim to "align" the distance matrices by shuffling and 
eliminating the rows (and corresponding columns) so as to maximize the similarity between the matrices.^ 
The similarity between two aligned matrices is defined in the form of root mean square difference |[T2l . cor- 
relation coefficient [3], information-theoretic 'total interdependency' of multiple alignments [16], Student's 
t [4] or the size of the largest common submatrix [17|. Because an exact solution to the matrix alignment 
problem (where the goal is to maximize any of these notions of similarity) is hard to compute, many avail- 
able approaches employ heuristics based on swapping pairs of rows/columns in a greedy fashion. These 
methods also commonly perform column/row elimination from the "larger" matrix only, and not the other 
Emm [121 [16 1- We are aware of one exception by |17], which aims to determine the largest common 
(i.e. within a threshold) submatrix and removes the remainder of the columns and rows from both matrices. 
Similarly the only approach which directly compares the tree topologies themselves is by Q, which uses a 
Metropolis algorithm to heuristically travel 'tree automorphism' space. However, this approach can not han- 
dle trees of different sizes. See ||6l|9l[l7l for references on mirrortree approaches which do not necessarily 
relate to the mapping problem. 

Our Approach: Modeling and Formalization. In this paper we present polynomial-time algorithms that 
determine mappings of leaves which respect the topology of their two gene trees. As input, we are given two 
"gene trees" T and T' of two protein/domain families known to interact with one another. T and T' have 
labeled leaves where labels reflect species such that the presence of the same label at two different leaves 
reflects the presence of paralogs. We then delete both leaves and inner vertices from both trees until the 
remaining trees are isomorphic, i.e. that is one can map the vertices of the two remaining trees in a one-to- 
one fashion onto another such that ancestor relationships are preserved. This in particular implies a one-to- 
one mapping of the remaining leaves, which we present as output. Clearly, there are many different possible 
choices of such one-to-one mappings of leaves — our algorithms determine the score-optimal such mapping 
where different deletion operations are penalized in different ways, depending on how they transform the 
topologies of the trees. We describe the nature of our scoring scheme in a little more detail in the following; 
please see the Methods section for full details and precise notations. 

We denote a bijection (i.e. a one-to-one and onto mapping) of subsets of vertices of T, T' by M. [T, T'] 

'Note that mirrortree approaches differ from approaches that aim to reconcile evolutionary trees into a single summary I?). 
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and write 

M := {{v, w) eTxT' \ M{v) = w} (1) 

for the pairs of mapped vertices. Note that in such a bijection, not all vertices of T are necessarily mapped 
to a vertex in T' and vice versa. We refer to vertices which are not mapped as deleted by M.[r, T']. We 
only consider mappings which satisfy the following: (1) the mapping preserves the ancestor relationship of 
T and T'; (2) only leaves with identical labels are mapped onto one another; (3) upon deletion of vertices, 
where deletion of an internal vertex v leads to new edges joining the parent of v with the children of the 
two tree topologies are isomorphic. Among the mappings satisfying the above conditions, we compute the 
mapping that has maximum score. 

For a formal definition of our scoring scheme, consider the internal vertices of T and T' that are deleted. 
Among them, we distinguish between vertices v that have descendants x which are not deleted. We write 
Ni for such vertices. We write Nt for the remaining deleted vertices. Note that each vertex v G Nt makes 
part of a subtree of T which has been deleted as a whole. The score of the mapping is then defined as 

s{M[T,T'])= SM{vy)+Y, snAv)+ ^^T^y)- 

(t),t)')gM v&Ni v&Nt 

The individual score functions Sm , Snj and Snt will be formally defined in the Methods section. Our 
algorithm, which maximizes the overall score of the mapping, can be viewed as an extension of the standard 
tree edit distance algorithm for unweighted trees (e.g. O), to those with edge weights. Determining the 
tree edit distance is NP-complete iflTl (in fact MAX-SNP-hard ll20l ). Since the instances treated here are 
too large (trees have up to more than 200 leaves) we have to impose reasonable constraints when aiming at 
fast, polynomial-runtime solutions. Motivated by test runs (see numbers referring to Ci^2,3 in the Results 
and Discussion section), we chose to impose the additional constraint that a vertex u and its parent v cannot 
be deleted at the same time without that the entire subtree rooted at v is deleted. That is we disallow to have 
both a parent v and a child uinNi. Note, however, that deletion of two internal siblings is permissable — we 
found that such deletions can lead to favorable mappings. As the operation of deleting entire subtrees does 
not lead to runtime issues, does not perturb the topology of the remaining trees and also reflects the biolog- 
ically reasonable assumption that interaction can be lost for entire subtrees, we allow it without additional 
restrictions. 

Note that the algorithm only outputs one uniquely determined, score-optimal mapping of subsets of 
leaves of T,T'. Note further that we do not perform an exhaustive search since we do never consider 
mappings of leaves which imply mappings of internal vertices that do not preserve ancestor relationships of 
the gene trees T, T' and thereby contradicts their topologies. 

Alternative constraints leading to polynomial time solvable variants on the tree edit distance is surveyed 
in |[T9l . For further, more recent work see also [11] that address the subtree homeomoiphism problem, 
which, given a "text" tree T and a "pattern tree" P as the input, asks to find a subtree t in T such that 
P is homeomorphic to t. Now, two trees Ti,T2 are said to be homemorphic if one can remove degree 2 
vertices from Ti, T2 such that Ti and T2 are isomorphic. Another recent work 1 13 1 considers homeomorphic 
alignment of "weighted" but unlabeled trees. Here the goal is to obtain a homeomorphic mapping between 
vertices of two trees such that the differences between the weights of "aligned" edges is minimized. While 
being related to our approach, the method described in fT3l is not applicable to our problem as the trees they 
consider are not leaf labeled. We refer the reader to [1] for a general and gentle overview of further related 
work on tree edit distance, tree alignment and tree inclusion. 

Summary of Contributions. The main technical contribution of this paper is a novel deterministic mir- 
rortree algorithm that directly compares tree topologies. The algorithm is optimal within the single con- 
straint we impose and is provably efficient. We compare our algorithm with the most recent, state-of-the-art 
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heuristic search approach H that aims to maximize the similarity between distance matrices, where dis- 
tances reflect lengths of shorted paths in neighbor-joining trees. In our comparisons we use precisely the 
same trees to be able to juxtapose a distance matrix-based heuristic search method to our topology-based, 
deterministic method without introducing further biases. Our main conclusions are as follows. 

• We can compute mappings for the 488 interacting domain families in roughly 1 minute on a single 
CPU - in comparison to 730 hours on MareNostrum ^ needed for the Metropolis search performed by 

m. 

• We outperform the Metropolis search in terms of precision, i.e., the percentage of correctly inferred 
pairings among all inferred pairings is higher (48%) in our approach vs. that (43%) in ll4l'): 

• In terms of F-measure (see the Discussion section for a definition), which has been most recently sug- 
gested for assessing mirrortree approaches in terms of both recall and precision lITTl ^. our topology- 
based approach again prevails (0.47 over 0.45). 

2 Preliminaries and notations 

Let T = {V, E,w) be a. tree with weighted edges as given by a non-negative weight function w : E ^ R^.. 
We denote the leaves of T by L = the internal nodes of T (excluding the root) hy U = 

{ui, Um}, and the root of T by r. In particular let n be the number of leaves and m be the number of 
internal vertices without the root. Note that a tree T is binary and rooted if and only if deg(r) = 2 and 
deg(n) = 3 for all internal vertices u £ U; this will imply that m = n — 2 and \E\ = 2n — 2. In our 
setting, edge weights w{vi,Vj) reflect the evolutionary distance between adjacent vertices Vi,Vj. Note that 
leaves refer to gene products whereas internal vertices can be interpreted as speciation and/or duplication 
events. For a given vertex v £ V, v/e, define 9{v) as the evolutionary distance between the root and v. In 
other words, 6{v) is the sum of the edge weights in the unique path from the root to v. In rooted trees, there 
is a natural partial order 

Vi < Vj 4^ Vi is an ancestor of vj (3) 

on the vertices of T. Hence, the edges have a natural orientation and each vertex Vi induces a unique 
subtree T{vi). This partial order is crucial for our algorithm — which can not be applied to unrooted trees 
in a straightforward manner. For processing unrooted (e.g. neighbor-joining) trees, consider the pair of 
proteins/domains (one from each tree) which are known to interact. We root the two trees at these vertices 
in order to apply our algorithm. Provided such a pair exists (which is typically the case), our algorithm 
optimally aligns the trees as it does not assume any order among the many sibling vertices. In a tree T 
which is rooted at r, we call vertex u the parent of a vertex v if u and v are connected by an edge and u 
is closer to r than v. The height of a rooted tree is defined as max{d(r, li) \ i = 1, n} where d{vi,V2) 
is the length of the shortest path between vertices vi and V2 without considering edge weights, that is the 
maximum (unweighted) distance of the root to a leaf. 

3 Methods 

Given two rooted weighted-edge trees T and T', our algorithm aligns the trees by mapping a subset of leaves 
of T to a subset of leaves of T'. In order to obtain this mapping, a series of (1) individual vertex deletions 

^MareNostrum is a supercomputer of the Barcelona Supercomputing Center, one of the largest machines in the world dedicated 
to science |4, p. 10]. 

^Note that 1171 suggest Fo.i, which favors the topology-based approach even more than F0.25 which we use. 
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Root 1 Root 2 




Figure 1 : Two isomorphic trees are shown as an example in this figure. The leaves of the left tree are labeled 
with ai, a2, as, 04 while the leaves of the tree on the right are labeled with 61, 62, 63, 64. A possible mapping 
between the leaves that respect the tree topology is (ai, 63), (02, 64), (03, 62), (04, 61). 

or (2) subtree deletions (with specific penalties) are performed on each tree with the goal of obtaining two 
isomorphic trees Ti = {Vi, Ei,wi) (from T) and T{ = {V{, E[,w'i) (from r{); Figure [T] shows two such 
rooted trees that are isomorphic; it also shows a mapping between the leaves. The specifics of vertex and 
subtree deletions on a tree T = {V, E, w) are as follows. 

1. Deleting an internal vertex v also deletes the edge [u, v), where u is the parent of v. Furthermore, it 
connects each child x of v to ti by deleting the edge {v, x) and creating a new edge {u, x). The weight 
of this new edge, w{u, x) is set to 'w{u, v) + w{v, x). As mentioned earlier, it is not possible to delete 
both a node v and its parent u from T. 

2. Deleting an entire subtree rooted at an internal vertex v deletes all descendants of v and their associated 
edges. 

In the remainder of this section, we will discuss the costs of the above deletion operations and the scores 
of the mapped vertices. As mentioned earlier, the overall score of the mapping will be the sum of the scores 
of the mapped vertices and the scores (negative costs) of the the deletion operations. 

3.1 Scoring Scheme 

Let Ti and T{ be the isomorphic trees which result from performing a series of deletion operations on T and 
T'. The isomorphism <I> : Ti ^ T{ implies a mapping (a.k.a. alignment) A4[T,T'] between the original 
trees T,T'. Let Li,L'^ denote the sets of leaves that are mapped in T and T' respectively; because the 
mapping is a bijection, we must have \Li\ = \L[\. We write SP := {{I, I') \ I € L, I' € L' , {I, I') € M} C 
Ai [T, T'] for the set of mapped pairs (we require that mapped leaves have identical labels hence the naming 
SP for 'species'. 

Recall that a mapping of two trees may involve deleting internal vertices or entire subtrees. We now 
distinguish between two types of internal vertex deletions, a.k.a. edge contractions. 

1. [Isolated Deletion:] deletion of only one child w of a vertex u. Let further xi, X2 be the two children 
of V. Isolated deletion of v also implies to also delete edges {u,v), (f,X2) and create new 
edges {u,xi), (u, X2). 

2. [Parallel Deletion:] deletion of both children (say x and y) of a vertex v. This implies deletion and 
creation of edges in a fashion analogous to that for isolated deletion. 

Accordingly, we further distinguish between isolated deleted vertices Nj^iso and vertices which became 
deleted in parallel Nj^par such that Nj = Nj^iso U Nj^par- For a given mapping M[T, T'] let Es{M) := 
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{{u,v) \ V £ Nj^iso} be the set of edges which join isolated deleted vertices with their parents. Analogously, 
Ep{^A) is the set of edges that join deleted siblings with their parent. See figure |2]for examples of isolated 
and parallel deletions. 

Given a pair of mapped leaves £i,£2 G SP their alignment score, ^2) is defined as 

'iCh,i2) = c-\e{ii)-9{i2)\ 

where C is a positive constant, providing a positive contribution to the overall score because of the alignment 
of two leaves with the same label while we subtract the difference between the distances of £1 and £2 from 
the root for penalizing the alignment between two leaves which have topologic differences. 
The total score S of an alignment A4 [T, T'] as per the above definition is fully specified by 

S{M[T,T'])= Yl <^i^~^2)- Yl E-w{es)- Yl ^ ' ^i^p) (4) 

(^i/2)eSP es&Eiso(M) epGEpariM) 

where, with respect to the formulation in (|2]l, the term in the first row is for y'^M Sm{v, v'), the second 
row is for X^^g^Vj Sni{v) and "^^v&Nt ^Nt{'^) is zero. E and F are user-defined constants that respectively 
penalize isolated deletion and parallel deletion of edges. Note that this penalty is proportional to the length 
of the edges joining the deleted vertices with their parents — deletion of longer edges leads to a more severe 
perturbation of topology hence is more severely penalized. We set the cost of deleting a subtree (i.e. Sn,^) 
to 0. Note, however, deleting subtrees is implicitly penalized by disregarding any potential good mappings 
of leaves in them. 

Given the above score function, the gene tree alignment problem can be formally stated as follows. 



Gene Tree Alignment Problem 

Given two rooted weighted-edge trees T, T', determine subsets of leaves Li C L,L'i C L' of equal 
size such that the corresponding subtrees can be transformed by isolated and parallel edge contraction 
and subtree removal operations into trees Ti,T{, for which there is an isomorphism <I> : Ti — )• T{ that 
maximizes S{M[T, T']). 





(a) No contracting edges (b) An isolated contracted (c) Two contracting edges: 

(^,^7) {A7,A5)and{A7,Afi) 



Figure 2: A gene tree |(a)[ with an isolated contraction of the edge (A^jAj) subreffig:onecontract and a 
parallel contraction of the edges (A5, Aj) and (^6) ^7)[(c)| 



3.2 A Dynamic Programming Solution 

The gene tree alignment problem can be efficiently solved by a dynamic programming algorithm. Our 
algorithm runs in 0((|y| • \V'\) time for two binary, rooted trees T, T' with vertex sets V, V'. In general, 
our strategy can be applied to arbitrary rooted trees with bounded maximum degree, Amax- Note that by 
allowing to delete internal vertices (i.e. contract the edges), the number of children of an internal vertex will 
be still bounded by a constant (< 4). 
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Initialization As a first step, we remove all leaves that refer to species that are unique to each tree. Let 
n = |y| and n' = \V'\. For every pair of vertices vi £ V and v'^ G V' (i.e. for every i = 1, • • • , n and 
j = 1, • • • , n'), we compute the maximum alignment score for the subtrees rooted at Vi from T (i.e. T{yi)) 
and v'j from T (i.e. T'{vj)). We denote the maximum aUgnment score for T{vi) and T'(vj) by Sij. 

In our dynamic programming algorithm, we handle the "base" cases, where one (or both) of T{vi) or 
T{vj) have 3 or fewer leaves, as follows. 

• If both Vi eV and v'j G V are leaves, then by definition, Sij = K{vi, Vj). 

• Without loss of generality, if Vi is a leaf and Vj is an internal vertex, Sij = max(5ij^ , Sij^), where ji 
and j2 correspond to the children of v'j. 

• The remainder of the base cases have both Vi and Vj as internal vertices and are solved through 
exhaustive evaluation of all possible ahgnments. 

Recursion internal vertices, each with at least 4 descendants, Sij will be computed through recurrence 
equations. These equations are based on the alignment scores between subtrees rooted at the children (or 
grandchildren) of Vi and Vj. Let and Z2(j2) be the children of the vertex Vi{vj). Also, let in, iu be 
the children of ii, and Z2i, ^22 be the children of 12- Similarly, let jn, ji2 be the children of ji, and j2i> 
j22 be the children of j2- We first give a high level description of the recurrence equation. Suppose that 
the maximum alignment score between any subtree in T{vi) and any subtree in T'{vj) has already been 
computed. In order to compute the alignment score Sij, we consider several cases: we can either delete one 
or both subtrees rooted at the children of Vi and Vj (deleting an entire subtree) or align the subtrees rooted at 
the children of Vi and v'j to each other. We can also delete one of the children of Vi (either ii or 12) together 
with one of the children of Vj (either ji or ^2) and align the three resulting subtrees in T{vi) to a permutation 
of the ones in T'{v'j). Finally, we have to consider the case where both children of the root (i.e. i\ and 
12 in T{vi), and ji and j2 in T'{vj)) are deleted. In this case we align four subtrees in T{vi) (rooted at 
^11, ^12, ^2i» ^22) to a permutation of the four resulting subtrees in T'{vj). The optimal ahgnment score of 
Sij will thus be the maximum alignment score provided by all of the cases above. 

Let e{v) denote the penalty for isolated deletion of an internal vertex v, which is the product of the 
constant E and the weight of the edge between v and its parent (see Scoring Scheme section). Also, let 
f{v) denote the penalty for parallel deletion of both children of an internal vertex v. f{v) was defined as a 
constant F times the total weight of the edges that connect v to its children. The recurrence equation for Sij 
thus becomes the following 

'^We have to consider all the permutations because the trees are unordered (i.e. the order of siblings of an internal vertex is 
unimportant). 
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(5) 



where the permutation vr = 7ri7r27r37r4 ranges over all permutations of {in, ji2, j2i5i22}- Note that some 
cases are redundant but are still represented here for the sake of clarity. 

Now, given r and r', the roots of T and T', respectively, the alignment score S'y.jj-2 (i-c the maximum 
aUgnment score of the rooted trees) can be computed using the above recurrence equation, providing a 
solution to the gene tree alignment problem. It is quite straightforward to prove that our algorithm correctly 
computes the maximum alignment score through a (sttong) induction on the sum of the heights of the 
rooted trees. Note that the scores of internal vertex aligimients can be computed through the scores of the 
aUgnments between their (grand)children and the recurrence precisely serves to satisfy the constraints. The 
base of the induction is trivial. If the minimum height of the trees is zero (i.e. one of the trees is just a single 
leaf), the optimal value of the alignment can be found using the definitions and simple case analysis. Given 
the subtrees T{vi) and T'{vj), with heights h and h', respectively, we assume the induction hypothesis, that 
for all pairs of subtrees T{vp) and T'{v'^) with heights hp and hq such that hp + hq < h + h'. It is easy to 
verify by case analysis that all cases in the recurrence equation will be reduced to a case in which the sum 
of the heights of the aligned (grand) children will be less. 
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4 Results 



Data Source and Alternative Methods. We benchmarked our algorithm against the most recent heuristic 
search method 141 for determining a mapping in the presence of paralogs on the large-scale data corpus 
described in the same study. This data set contains multiple alignments for 604 yeast protein domains 
among which 488 domain pairs are known to co-occur in the same protein, those 488 domain family pairs 
is considered to be a particularly tough test [4J due to the presence of approximately 6 paralogs per species 
on average. For all interacting domain family pairs, neighbor-joining trees were computed, using ClustalW 
ifTSi and the trees were rooted at the domains which are known to interact. 



Evaluation Criteria. Following HI, we determine the maximum number of protein domains that can be 
paired without topology constraints; i.e. if we have ki paralogs of a particular protein domain di and k2 
paralogs of domain d2 within the same species, then this species contributes min(/ci, ^2) to the overall 
count. By the usual conventions, we denote this value as P. Among P potentially correctly paired protein 
domains, the number of those which have been inferred by the algorithm in use, such that both domains 
reside in the same protein, is referred to as "true positives", TP. Similarly the number of protein domain 
pairings computed, which do not reside in the same protein are determined as "false positives, FP. Recall 
(Sensitivity) is defined as Rec = TP/P and Precision (Positive Prediction Rate) is defined as Prec = 
TP /{TP + FP) while the F-Measure is determined as (1 + 0.252) • Rec ■ Prec/ {0.25^ Prec + Rec). 

Note that Recall is referred to as Accuracy in [4J. We determine Precision, Recall and F0.25 for each pairs 
of trees individually. Values displayed in tables [T] [2] and [3] are average values for all 488 co-evolving tree 
pairs. 



Tree Constraints. In order to appropriately assess the contribution of the different tree constraints as 
outlined in the Methods section, we evaluated our algorithm by not allowing to contract edges (Co : C = 
1,E = 00, F = cxD in allowing edge contraction (without penalty, that is iiJ = in Q) up to creating 
ternary, internal vertices (Ci : C = I, E = 0, F = 00 in (01)) as well as further allowing creation of 
quartemary vertices through parallel contraction of two edges, see Fig. |2][(c)] for an example (Ci^2 : C = 



1, E = F = Oin^ test case Cger where we also allow for deletion of vertices in a parent-child relationship 
(= serial)^ without penalizing any sort of deletion. We achieved best results in the case of Ci^2 and further 
determined that to considerably penalize parallel contraction in contrast to imposing only a relatively mild 
penalty for isolated contraction yielded an optimal choice of parameters = 2, F = 50 (referred to as 
Copt)- We suggest ratios C/E = 1/2, E/F = 1/25 as default settings. However, determination of absolute 
values needs to put into context with orders of magnitude of edge weights of the trees under consideration. 

As outlined in the Methods section, inducing tree constraints considerably reduces the search space, 
thereby allowing for an efficient and deterministic method. To also highlight these effects, we further de- 
termine the size of the largest correct mapping which does not violate the tree constraints, CP ("Constraint 
Positives"). We compute RP = ^ ("Relative Positives") as the fraction of pairings that can still be in- 
ferred, which is a value which reflects how the reduction of search space influences the number of correct 
pairings. We further compute RelRec = ^ ("Relative Recall") as a recall value which reflects how many 
of the correct pairings possible were inferred by the algorithm in question. Note that the heuristic search 
does not impose any constraints on the search space hence CP = P such that Recall and Relative Re- 
call coincide. Juxtaposing RP and RelRec values are meant to put usage of tree topology into a general 
perspective. Moreover, RelRec values certainly shed light on the effectiveness of the search strategy in use. 



^Note that Fo.i, which has been recently suggested as an appropriate prediction quality measure for mirrortree approaches |17| . 
yields results which are even more in favor of our approach. 

^Thereby we do not allow for deletion of grandparent-parent-child relationships, which preserves an efficient recurrence scheme. 
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Table [Upresents numbers of all 488 tree pairs. Following ||4l, we also separate tree pairs according to the 
numbers of leaves of the larger tree (see Table |2ll and the product of the numbers of leaves of the two paired 
trees (see Table [3]) which, according to [4], quantifies search space size. Optimal values are highlighted in 
all categories. 



Method 


RP 


Recall 


RelRec 


Precis 


Fo.25 


Co 


0.546 


0.330 


0.557 


0.447 


0.438 


Ci 


0.610 


0.378 


0.586 


0.475 


0.468 


C{1,2} 


0.612 


0.377 


0.581 


0.471 


0.464 


Cser 


0.638 


0.373 


0.556 


0.444 


0.439 


Copt 


0.612 


0.380 


0.588 


0.479 


0.472 


Heur. 


1.000 


0.550 


0.550 


0.450 


0.450 



Table 1 : Evaluation of our method with different choices of parameters and the previously published heuris- 
tic approach [4J (= Heur., values have been rounded to the order of 10^^). Baseline values for Recall and 
Precision are i =w 0.17. 









MaxSize 










<120 






>120 




Meth. 


Rec 


Prec 


Fo.25 


Rec 


Prec 


Fo.25 


Co 


0.389 


0.511 


0.502 


0.200 


0.305 


0.296 


Ci 


0.436 


0.537 


0.530 


0.249 


0.338 


0.331 


Cl,2 


0.436 


0.534 


0.527 


0.245 


0.331 


0.527 


Cser 


0.437 


0.525 


0.519 


0.231 


0.262 


0.260 


Copt 


0.439 


0.541 


0.534 


0.251 


0.340 


0.333 


Heur. 


0.700 


0.540 


0.550 


0.340 


0.280 


0.280 



Table 2: The comparison of our method with the heuristic search method shows favorable results for large 
trees (>120 leaves) for our method. For Heur., values have been rounded to the order of 10^^. 



5 Discussion 

Runtime. The possibly most striking advantage of the topology -based approach is the drastic reduction of 
runtime — we can align all trees in ?a 1 minute on a single processor laptop instead of 730 hours on a super 
computer. Note that there are rapidly growing large-scale phylogenetic databases such as ENSEMBL [2 | or 
PhylomeDB ifTOl . whose growth is further accelerated by next-generation sequencing projects (as of 12th 
August, 2011, PhylomeDB contains 482,274 phylogenetic trees). The reduction in runtime delivered by 
our approach certainly overcomes a major obstacle — we render large-scale mapping and, as a consequence, 
comparison of paralog-rich gene trees feasible. Note that this reduction has become possible by imposing 
both computationally and biologically reasonable constraints on the search space while at the same time 
allowing for an efficient scheme to find the global optimum within these constraints. 

Search Space Size / Recall. Comparing Copt with the method of f4l (Heuristic) overall, clearly, Q 
achieve best recall. As pointed out above, this comes as no surprise since we cannot explore pairings that 
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Space 










<11680 








>11680 




Meth. 


Rec 


Prec 


Fo.25 




Rec 


Prec 


Fo.25 


Co 


0.361 


0.478 


0.469 




0.191 


0.305 


0.295 


Ci 


0.409 


0.506 


0.499 




0.239 


0.336 


0.328 


Cl,2 


0.407 


0.501 


0.494 




0.239 


0.334 


0.326 


Cser 


0.410 


0.489 


0.483 




0.205 


0.238 


0.236 


Copt 


0.410 


0.508 


0.501 




0.245 


0.343 


0.335 


Heur. 


0.640 


0.500 


0.510 




0.280 


0.20 


0.20 



Table 3: The comparison of our method with the heuristic search method reveals favorable results for large 
search spaces (Space > 11680). For Heur., values have been rounded to the order of 10^^. 



contradict the topologies of the paired trees. Quite surprisingly though, although usage of tree topology 
and neighbor-joining trees in particular have been discussed rather controversially |18|, we find that still 
the majority of pairings (54.6% with the strictest constraints and 61.2% for allowing isolated and parallel 
deletion) can be determined by a topology-based approach. These numbers may put usage of neighbor- 
joining tree topology in mirrortree approaches into a general perspective. Moreover, note that the fraction 
of correct domain pairs computed by our method over that of the heuristic search method is about 0.7 
(= TpSfuriL) = Ref^m^urisL) = Mf) which is morc than what was to be expected by reduction of 

the search space ( cp(Hiuri'stic) = P^Hi^^iTtk) = RP{Copt) = 0.61) which points out that we compensate 
search space reduction by a more effective search strategy. This becomes reflected by the better RelRec 
values of Copt- 



Precision and F-Measure. Precision also favors the topology-based approach, at least on larger (com- 
binations of) trees (see column Prec in all three tables). Better precision reflects a larger fraction of the 
correct domain pairs among the pairs inferred overall and ifTTl argue in a most recent contribution that 
precision is more relevant than recall in mirrortree approaches. Consequently, they suggest the F-measure 
Fo.i = ^^(^i-jprc^sionH^cca^^^^^ ^^^^^s Overall performance. We slightly take issue with this suggestion as 

we feel that Fq.i overrates Precision and instead suggest the more balanced Fo.25 = *'^(o 25^4^recision^H-Rcca^ • 
We achieve better values in terms of Fo.25 than |4| on pairs of larger trees. See Prec and Fo.25 in tables [T||2] 
(in particular > 120) and [3] (in particular > 11680) for related results. 



Conclusion. In summary, we have, for the first time, devised a deterministic and efficient, polynomial- 
runtime mirrortree approach which directly compares the gene trees, and not the distance matrices behind 
or giving rise to them. We have juxtaposed our approach with the most recent, state-of-the-art matrix-based 
heuristic search procedure without introducing further experimental biases. Our tree topology-based algo- 
rithm lists efficiency — its runtime is better by several orders of magnitude, reducing runtime from several 
hundreds of hours to only one minute when mirroring ^^500 trees — and precision as its benefits. Recall is 
better for the heuristic search which is explained by that the inherent search strategy does not impose any 
constraints on the search space. Our advantages become most obvious for large trees and in particular when 
both of the mirrored trees are not small. Here, our algorithm also achieves comparable recall values while 
our advantages in precision become distinct. This leads us to conclude that the heuristic method remains 
the better choice for smaller trees and when runtime is not an issue. In case of larger trees and in particular 
for large-scale studies, our approach has considerable benefits. Note finally that we have been comparing 
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neighbor-joining which have been repeatedly exposed as suboptimal choices of phylogenetic trees. We be- 
lieve that our approach can gain from improvements in tree quahty significantly more than the matrix-based 
approaches. 
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